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ABSTRACT • ' ' : * ' ' , ^ . . 

I Evaluation design is discussed in terms of conditions 

that an adult education intervention (product, practice) must meet to 
get Joint Dissemination and Review Panel (JDRP) approval.' 
(fiff ectiVeness/ the sole criterion ,foi: JDRP approval, must'be. 
established by evaluation data adequate to tie the project ahd*^ 
<|.esi red • impact together in a catrse-and-effect relationship.): Four - ^ 
donditions examined by the JDRP are considered: (1) the Evidence must 
be valid and reliable, (2) the effect must ^e of sufficient magnitude 
and have educational importance, (3)'' it^ should be possible to 
reproduce both thp intervetnt ion and it^' effects at other sites,, and 
(4) project data must be believable and interpretable. Discussion , of 
statistical significance are size effect, i;nportai?fce of the 
educational area, and cost of the intervention. Considerations for 
r^plic.ability include setting, staff , participants, and components. 
Topics under the final condition of belieyabiXity and • 
interpretability incl,ude consistency of factual data in narr>tive^ apd 
tatles^ completeness of data,? and objectivity maintained in gatheVing^ 
data. An evaluation design checklist is appended. (YLB) 
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INTRODlfeTIGfN 



The p^lem Is that oi proving the effectiveness of adult education interv^tions 
(i.e«9yproductSy or practices) to the Joint Dissemination and Review^ Panel {3«0«R4P«) 
of tne U«S Oepartmmt of Education* 



In order to be endorsed by 3Ji.ILP. for the Department of Education, educational 
interventions must be shown tQ hive positive Impacts on their recipients*. ThoM 
, positive impacts may be educational, attitudinal or behavorial dn nature. 

Effectiveness is thfc sole ^Iterlon for approval by J^O.RJ'. In order to establish 
effectiveness there must be evaluation data adequate to tie the project and the 
desired impact together In a cause-and-effect relationship* To get J*D*R*F* 
approval an intervention must meet several conditions: 

L The evidctri&e must be valid tnd reliable: Statistical Significance 
IL The effe^t^ust be of sufficient magnitude and have educational 
■• Importanl^' Educational Significance 

nL It should|lMp6ssible to reproAjce both the interventioftt and its- : 

m ' - • ■, 

effiects iflither sites: RepUcabillty ^ 
IV. Project l^li must be beUevable and interpretable: Believability 
)mty ( - 




L STATISTICAL' SIGNIFICANCE ' *^ 

The main id«ii in evaluating an exemplary pro-am is- to measure t^e intended 

i 

positive effect which was achieved because of the Intervention and which was. 
not compromised because of- side effects, '^e measure(s) used must be 



statistically valid and reliable in order to establish statistical.signiiicance* 

A. • VALIDITY " / * 
> . , L Logical Measurement 

: \ y ' I 

A YMlld measure Is. one that rjelates specifically to a certain kind of ' 

y^hMnge^ This change can be educational, attitudlnai or behavorlal and 

. each kind of diange requires a different^ kind of test« The measure 

selected must bear a logical relationship to the specific behavior being 

examined* If measures are unrelated to the behaviors that a program 

seeks to change^.h is impossible to draw accurate conclusions about 

program effectiveness* For example, a program developed to iipprove 

reading skills cannot use In its evaluation measures of self-concept, 

mathematics skills or attitudes towards^ education because ^these 

' measures do not ^have a direct relationship to heading skills* 



2* Uncompr o mlsed by side effects 

The effect or effects must «iso be shown to be ^compromised by side, 
effects. Sonte side effects which must be considered «nd rejected are:' 



a. Side effects from the experiment. 

b. Other simultaneous innovations, 
c Changes |||,j)opulation. 

d. i3iff erencei in growth factors. 



All of these possibilities should be 6>nsidered in interpreting evaluation results and^ 
as far as possible, reasons* should be presented' why whatever gains^were observed 
should be attributed to 'the treatment and not to such other influe^Kes as the side 
effectl just listed. It is possible to make provisions in the design of the evaluation 
to negate many of these alternative explanations. For this reason it is highly 
desirable to obtain the services of a professional evaluator at the very beginning of 
a project so that a proper evaluation design may be created.^ 



If a control or coi^parison. group is used, «very effort must be made to i^isure that 
its members are as similar as possible to th<^ in the treatment group. Systematic 
differefKes between groups in other factors such as urban or rural environment, 
race, socioeconomic status, or lex may be related to school performance. If these 
educationally relevant factors are not similar for both groups, it is difficult to 
determine whctH^^ observed differences resulted from the intervention, o^ 
itonjf differences in these other factors. . ^ 

%. 

An Example of Convincing Evidence: ' 

Perf ormaiKre in other areas can serve as one indication of change or consistency in 
the sttjdent population.. For example, a new reading program in the small town of 
Andover, Massachusetts, had apparently produced a significant improvement in the 
p«rformance of the students on district-wide standardized reading tests. The 
question was whether the effect might have been.the reiilt of an Influx of higher 
achieving students into the dis^ict. The evaluators stated that there had been no 
perceptible change in the composition of the population over the previoi^ t^ree 



yeal^ To support this statement^ they pointed out that the new program 
emphasized reading Comprehension^ and there had been iarge gains on t^ie 
comprehension, subtests* However, performance on word attack skills^ emphasized 
in both the oid and new programs, remained about the same before and after the 
intervention. It therefore appeared unlikeiy that the ability level of the students 
hadchanged« 

3. Compared to Change without the Intervention 

> 

There must also be some credible estimate of conditions that would be in effect 
without the intervention through the use of cohtrol groups, comparison groups 
or other appropriate standard such as a time series design. 

The most severe barrier for adult education. programs is that alAnembers of the 
Wget grou{is are allowed to participate in. the Innovative program to be 
measured. If this dofidition is not compensated for, there Is no basis on which to 
measure differences in levels of achievement* Statistical compensation for this 
and other conditions can be achieved through variations in the evaluation 
designs. 

Sound conclusions rest on three steps: 

a) Measuring the change in participants • 

b) Measuring the change in absence of the programs^ ; 

c) Oitnparing the two changes. 



\ ■ ' . 

VALID EVALUATION DESIGNS \ 

Description of various statistically vaild evaluation designs, in descending order 
of credibility, and the conditions under which they would be used iollow: ^ 

y 

U Random Selection Design 

Thii design is used if individuals can be randomly selected and assigned 
to either the participant group <ir the non-participating control group* 
The random selection and assignment of individuals to either group 
assures the statistical equivalence of participants and non-participants* 



Delayed Random Selection Design 

There ls\o control group if everyone participates in the progrcim* A 
comparison group is formed by randomly scheduling some potential 
paxticijJants to start the second cycle of a training program* The 
selected participants will . begin instruction following completion of 
InstnKTtion for remaining participants. The outcomes of the group 
receiving the treatment first can be compared to the outcomes of thp 
delayed group. 



3. Varied Instruction time Design 

e . 

A second compromise to the basic control group desigp, when all 
members of the target group participate in the program, con^ts of 
scheduling individuals or groups of individuals to participate to varying 
degrees. If two or more groups receive different amounts of instruction, 
this is sufficient to define an instructional variable* 



Random assignment of individuals to the various groups is essential to 
assume e<iuivalence. The instructional variable can be related to 
individual change measures through the use of a statistical technique 
known as correlation or regression analysis, 

V 

Matched Groups Design 

A third compromise to random assignment of individuals into treatment 
and control j^ups h to compare intact (pre-existing) groups who are 

^ similar in all relevant characteristics. The evaluator could seek to 
compare two different educational communities, one participating in the 
adult or vocational education program and one not participating .in the 
program. The matching of groups should be conducted on an individual 

- basis; that is, for each program participant a *»twin" is matched from the 
comparison group, w\ . 

A variation of this design is to compare two groups participating in 
different educational programs, (e,g, one traditional versus one 

r 

innovative). In this case the incremental effectiveness is evaluated (i^, 
the benefit/ accruing to participants over and above the benefits 
accruing from another program), 

y 

Intact Groups Design ' 

This design Involves the comparison of inti|ct (pre-existing) groups, but 
the comparison group may differ considerably from the treatment group 
in one or more relevant characteristics. Statistical adjustments must be 



made vi^ith respect to the sources of rK>n-equivalence/ between groups* 
Because of the complexity of making these adjustments they should be 
made by a qualified statistician* 

Delayed Intact Groups Design 

This design is the same as^^the intact groups design except that the 
comparison group eventually participates in the adult or vocational 
education ^ program, after evaluation activities are completed* The 
comparison group is established by delaying the onset of th^ program for 
one of the groups. 

Varied Instniction'Time Design with Intact Groups 

This design Is used whenever Varied Instruction Time Design is 

appropriate but it is not possible to randomly assign individuals to 

> 

treatment groups and as a result the groups are not truly equivalent* The 
estimate of the relationship between the instructional variable and the 
performance of individuals would be statistically adjusted for all sources 
of non-equivalence of the varying groups. 

Selection Groups Design 

This design is used when selection of m^hibers of the participant group 
and of the comparison group is made on the basis of a single educational 
criterion. Through a statistical, technique kqown as "regr/ession 



diicontinuity" the post-test performance of the groups may be compared* 
Two post-test scores are statistically projected to represent the post- 
test performance of two hypothetical individuals achieving the same 
selection score* However, one has participated in the program and the 
other has not* The difference betwe^ these two projected scores 
reflects the effectiveness of the program. 

Norm-Referenced Design " , 

This design is used when there is no comparison group and it is not 
possible under any circumstances to locate one* If standardized tests are 
used With nationally normed scores available, the pre- and post-test 
scores of program participants can be compared to the performance of a 
nationwide sample individuals. It is especially important to provide 
documentation of the initial ^^tus or the expected growth rate of the 
participants in the absence of the intervention. - ^ ^ 

Time-Series Design ^ 

This design is used when a single program group is being evaluated in the 
absence of any comparison group, including a national norm group. An 
acceptable procedure is to examine ^e change of program participants 
over multiple points in time, before, during, and after tt\e beginning of 
instruction. ^ 
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When evaluation designs do not involve a comparison groupt but the perfonmance of 
the treatment group is compared with some norm or expectation it is especially 
important to provide documentation of the initial- status or the expected growth 
rate of the participanta^in the absence of the interventions ^ 

B. RELIABILITY 

A reliable measure is consistent in its measurement, tune aftpr time. 

Few evaluators would unquestioningly accept the result of any single , 

f 

small-scale study as adequate evidence of the success of any intervention 
^ regardless of the level of statistical significance . There should always be 
at least one replication or parallel studyl If, for example, comparable 
results are observed in two or more classrooms, or In two or more n ^ a 
successive years, or both, results become much more credible. .This type 
of " consistency of finding not only heljJi to establish statistical 
significance anH intuitive credibility, it is also directly relevant to the 
transportability criterion. 

To summarize, a measure which possess both validity and reliability as ^ 

* 

X defined above is statistically significant. 

In addition a proposed intervention must also establish educational 
significance. ^ 
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IL EDUCATIONAL SIGNIHCANCE 

^'Educational significance is not a matter of statistics but relies on judgment. 
The Joint Dissemination and Review Panel (J.p.R.P.) considers the following 
three factors in judging the educational significance of a program. 

A. Si2e of the effect 

B. Importance of the area of change 

C. Reasonableness of cost 

The J.D.R.P., in making a judgment, considers the first two factors together, 
assessing whether or not there is a reasonable balance between them. The 
chance that a small gain would be considered educationally significant is higher 
in a broad or educationally important area than in a narrow or less important 
area. ' 

A. Size of the Effect 




criterion is rate of growth that will produce a post-test percentile 



standing that exceeds the present percentile standing by one standard 



error of measurementi 
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Importance of the Area of Change 

There is a parallel between the breadth of focus of educational 
interventions and that of the measures used to assess their impact. 
Standardized achievement test scores are the most widely known and 
a^ccepted measures for "use in education evaluations* They have, 
howevfsr, been, justifiably criticized on several counts* From an 
evaluatbr*s viewpoint, the most relevant criticism is that they do not 
measure what \s bein^ taught* On the other hand, most people concede 
that such instruments do measure 'the ability to read and do arithmetic* 

The ability of any test to measure change will be directly related to how 
relevant the test items are to the content of the ipstruetion* . 



But, it is npt necessary to. use achievement tests to establisH the 
educational significance of ^ intervention* The most convincing 
evidence of success in a dropout-prevention program, for example, would 
simply be statistics showing a decrease in the number of students 
dropping out* Similarly> change in adult and vocational education 

programs might be measured in terms of job placements, starting 

J. 

salaries, rates of "^advancement on the job, etc* Obviously achievement 
tests could also be used* 



€• Reasonableness of Cost 

Another factor entering into the consideration of educational 
significance is the matter of cost* Because resources are limited, more 
people can be served by low-cost bterventions than by high-cost 
interventions. If they are available, coit-benefit ^gures should be 
presented* Cost-benefit can be defined as the amount of money which 
will be realized (i^. received or saved), over a'specific period of 'time, 
because of the operation of the program, in relation to the amount of 
money spent to operate the program. 

To summarize, an^ educationally significant e ffect is one of nontrival 
magnitude , in a>con|ent area generally accepted as important , which can 
be achieved at a reasonable cost. 



IIL llEPUCABILITY 

Statistical significance may reassure us that project results were no fluke, but 
that still does not guarantee that the intervention will be effective when 
replicated in other settings. In order to determine the likelihood that the same 
productsl or practices, when used elsewhere, will pfoduce results similar to those 
obtained at the original site, the panel considers the following four factors: 

r 

A. Uniqueness of Project Setting 

The project setting should not be so unique that the project could not be 
replicated elsewhere. An intervention .that works in an environment 
seldom found elsewhere may be deriving its effectiveness solely from 



that environment and without further evidence of repiicability would not 
be a good candidate for J.D.R.P* approval. 

Project Staff 

Although the 3,D*R.P* is concerned about the likelihood that an 
educational intervention will work eqOally well in another setting, few 
evaluations are designed to prove this. The primary focus of the 
evaluation is, as it should be, upon the effectiveness of the intervention 
as it was carried out* But, whenever it appears likely that one or more 
rare individuals exerted an influence that typical school personnel could 
not duplicate, the repiicability of the intervention is in question* There 
are various indicators of generalizability that can be provided without 
going to the trouble and expense of conducting a replication study* One 
technique is to involve more than one class and teacher in the original 
project wherever possible. Also, it is more convincing to select teachers 
randomly to inlplement a new approach that to use those who volunteer* 
The need is to provide evidence that teachers who carried out the 
intervention «^ere not unusual, so that one could expect teachers 
elsewhere to get similar results if they use the same products or 
procedures* If the project involves other s^^L members— administrators, 
project directors, or specialists ^ the \sam4 procedures should be 
employed* 
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C« Participants 

Similar considerations apply with regard to participants. The more there 
are, the better. Choosing them at random provides a more convincing 
case for repiicability than does using volunteers. li this is not possible, it 
is a good idea to collect any available evidence that will support claims 
that those who participated were not different from potential 
• ^ participants anywhere else, and that their performance was a typical 
result oi the intervention, rather than a unique response to it by unusual 
participants. 

D. Repiicability of Essential (^omponerits 

Some evidence must be presented that essential components have been 
identified and that these can be replicated elsewhere. Some examples of 
these components might be teacher training, parental involvement, 
individualization oi instruction, commercially available curricula. 

To summarize, setting , stggf , participants, and essentiai components 
should not be so unique that they could not be replicated elsewhere 

BELIEV ABILITY AND INTERPRETABILITY 

A. Consistency of Factual Data in Narrative and Tables 

* One of the most telling signs of a flawed evaluation i^e presence of 
inconsistencies in the data. An obvious problem is lack of agreement 
between numbers reported in the text and the tables, or among tables. 

Another is inconsistencies in the calculations. 
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Lack of a^eement between numbers could be the result of a 
typographical error. It could be an attempt to gloss over disappointing 
data. If fewer pupils were tested than were served, it could be the result 
of planned sampling that was part of the evaluation design and attrition 
may have left a biased sample at post-test time. Errors in ^culatlons 
may simply be mistakes, or there may be an attempt to make a "right" 
answer out of wrong data. Any errors, however, tend to detract from the 
overall credibility of the submission* 

Completeness of Data 

Lack of complete data also precludes siccurate interpretation of an 
evaluation. Sometimes submissions omit important information such as 
the names, form, and levels of tests used; the testing times, the numhJer 
of students tested; or the number of students served. 

Data may be presented on only some of the measures that were 
j^lnlstered. Failure to report all of the data Can arouse the suspicion 
that those not reported were somehow unfavorable. Whatever the 
reason. If Information is missing, the evidence cannot . be properly 
interpreted or taken at its face value. 



i6 
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In addition to basic information about the data that were coliected, tifiere 
should be a connplete and accurate presentation of the analysis of these 
data. Types of scores should be clearly identified, e.g. raw scores, 
publishers' standard scores. Normal Curve Equivalents (NCEs), etc* 
"Sumnrrary statistics should include both '♦means'* and "standard 
' deviations Each time scores are reported, sample sizes should also be 
reported. When statistical tests {u^e used, Jthey should be clearly 
identified; the rationale for their use, if not obvious, should be presented; 
and any assumptions made should be explained. ^ 

\ 

A major defect in some evaluation designs, particularly norm-referenced 
designs, is the •degression effect" error." For example, if participants 
were selected for a remedial reading projed^ on^^ basis of their low 
scores on the XYZ Reading Test, and if those same scores were then used 
to figure the average pre-project status of the students^ the gains 
attributable to the intervention would be overestimated. Unless the 
evaluation report clearly states that different tests were used for 
selection and for pretesting, the reviewer has no assurance that this 

« 

error was avoided. It is important to specify clearly that the scores used 
for selection of participants were not the same as those used for 
measurement of pre-intervention status. They must not be ^ the same. 
(Scores from an alternate form of the same test , however, are perfectly 
acceptable.) ' 



r 
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Objectivity Maintetined in Gathering Data 

Another important point to stress in the presentation of evaluation 
^ results is the objectivity of the data* Wherever it is possible for data to 
be contaminated, the write-up should describe measures taken to make 
stire that this did not occur. For example a common source of problems 
Is the procedure followed in testing. When tests are administe^ by 
persons with a stake in the outcome — such as staff members of a 
project, or those with a close personal relationship to the subjects, those 
test results are suspect. The belief is that the test administrators could 
have influenced the student's performance in some intangible and perhaps 
totally unintentional way. They may have g|yen extra directions, allowed 
more time, or deviated in some other way from the instruction for t^t 
administration. If the test required judgments or ratings by the 
administrator, their objectivity would be seriously in doubt* 

To make it clear that there were no irregularities in testing, an 
evaluation should specify who gave the tests, and under what conditions. 
For example, "Each participating class was given the XYZ Reading Test 
on May 1 (the same date that, the national norm group was tested). The 
test was administered by teachers, who had been thoroughly trained in 
the publisher's instructions for proper use of the tests. Although it was 
not possible to obtain outside test adniinistrators, the teachers were 
rotated so that they tested each other's classes, not their own." 



To summarize, in order to meet the tests of beiievability and 
interpretabilitY f the data pertaining to a project must be consistent, 
complete , and objective ^ 

The standards of statistical significance, educational significance, 
replicability and beiievability , are used by the Joint Dissemination and 
Review Panel O.D.R.PJ to judge the effectiveness of projects* Projects 
meeting these standards will be recognized by the U*S* Department of 
Education as worthy of adoption. 
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EVALUATION DESIGN CHECKLIST 



OBJECTIVES 



PROCEDURES 



EVALUATION 
DESIGN - 



MEASURES 



1. What objectives are planned as a result of the 
intervention? 

2. Is the project methodology adequately 
described?. ^ 

3. Will the procedures be consistently employed 
by all staff? 

4. What kind of change is intended? 
a« educational? 

attitudinal? 
c* behavorial? 

5. Will the change be logically measured by 
relevant pre-testing and post-testing? 

Will precautions be taken in the design of the 
evaluation ^to neutralize outside influences 
such as side effects from the experiment or 
maturation? 

7. Will the pre-test instrument be different 
from the instrument used for selection? 



CONDITIONS 



STATISTICS 



RELIABILITY 



S* How will the estimate of conditions without 
the intervention be measured ? 
a« control group? 
b. comparison group? 
c« other standard? 

9. Will the comparison group b^ reasonably like 
the treatment group or will it be matched 
statistically? 

10. Will the testing conditions the same for 
both groups? 

11. Will the statistical analyses be appropriate to 
the evaluation design? 

12. Is more than one observation planned to 
strengthen the case for reliability? 
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EDUCATIONAL 
SIGNIFICANCE 



REPLICABILITY 



BELIEVABILITY 



13. How will chance be ruled out as a possible 
cause oi the change? 

1^* Is th§. change expected to >e educationally 
significant as related to: ^ 

a. Size oi the effect? 

b. Importance of the area of change? 
c^ Resonableness of cost? 

15. Will it be possible to replicate the project in 
another location? ^ 

a« li the setting neutral in effect? . 

b. Can staff substitutes be found elsewhere? 

Are the participants-typical enough to 

keep the projcfct unaffected? 
d* Can essential components such ^ 

curriculum or teacher training courses 

be easily replicated? 

16. Will the evidence presented be believable and 
interpretable? 

a* Will the statistics in text and tables 
be consistent withje^ch other? 

b. Will the data be complete? 

c. Will objectivity be preserved irt gathering 
the data? < 
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